arXiv: 1502.07439v4 [cs.SI] 15 Feb 2016 


When Social Influence Meets Item Inference 


Hui-Ju Hung", Hong-Han Shuai 1 , De-Nian Yang 1 , Liang-Hao Huang 1 , 
Wang-Chien Lee*, Jian Pei § , Ming-Syan Chen 1 
"The Pennsylvania State University, State College, Pennsylvania, USA 
Academia Sinica, Taipei, Taiwan 
§ Simon Fraser University, Burnaby, Canada 
National Taiwan University, Taipei, Taiwan 


ABSTRACT 

Research issues and data mining techniques for product rec¬ 
ommendation and viral marketing have been widely stud¬ 
ied. Existing works on seed selection in social networks do 
not take into account the effect of product recommenda¬ 
tions in e-commerce stores. In this paper, we investigate 
the seed selection problem for viral marketing that consid¬ 
ers both effects of social influence and item inference (for 
product recommendation). We develop a new model, So¬ 
cial Item Graph (SIG), that captures both effects in form 
of hyperedges. Accordingly, we formulate a seed selection 
problem, called Social Item Maximization Problem (SIMP), 
and prove the hardness of SIMP. We design an efficient algo¬ 
rithm with performance guarantee, called Hyperedge-Aware 
Greedy (HAG), for SIMP and develop a new index structure, 
called SIG-index, to accelerate the computation of diffusion 
process in HAG. Moreover, to construct realistic SIG models 
for SIMP, we develop a statistical inference based framework 
to learn the weights of hyperedges from data. Finally, we 
perform a comprehensive evaluation on our proposals with 
various baselines. Experimental result validates our ideas 
and demonstrates the effectiveness and efficiency of the pro¬ 
posed model and algorithms over baselines. 

1. INTRODUCTION 

The ripple effect of social influence has been explored 
for viral marketing via online social networks. Indeed, stud¬ 
ies show that customers tend to receive product informa¬ 
tion from friends better than advertisements on traditional 
media JT8]. To explore the potential impact of social influ¬ 
ence, many research studies on seed selection, i.e., selecting 
a given number of influential customers to maximize the 
spread of social recommendation for a product, have been 
reported B 010 However, these works do not take into 

1 A11 the top 5 online retailers, including Amazon, Staples, 
Apple, Walmart, and Dell, are equipped with sophisticated 
recommendation engines. They also support viral marketing 
by allowing users to share favorite products in Facebook. 
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DVD seed creates an additional novel spread! 
Figure 1: A motivating example 


account the effect of product recommendations in online e- 
commerce stores. We argue that when a customer buys an 
item due to the social influence (e.g., via Facebook or Pin- 
terest), there is a potential side effect due to the item in¬ 
ference recommendations from stores[3 For example, when 
Alice buys a DVD of “Star War” due to the recommenda¬ 
tion from friends, she may also pick up the original novel of 
the movie due to an in-store recommendation, which may 
in turn trigger additional purchases of the novel among her 
friends. To the best of our knowledge, this additional spread 
introduced by the item inference recommendations has not 
been considered in existing research on viral marketing. 

Figure [l] illustrates the above joint effects in a toy ex¬ 
ample with two products and four customers, where a dash 
arrow represents the association rule behind the item infer¬ 
ence recommendation, and a solid arrow denotes the social 
influence between two friends upon a product. In the two 
separate planes corresponding to DVD and novel, social in¬ 
fluence is expected to take effect on promoting interests in 
(and potential purchases of) the DVD and novel, respec¬ 
tively. Meanwhile, the item inference recommendation by 
the e-commerce store is expected to trigger sales of addi¬ 
tional items. Note that the association rules behind item 
inference are derived without considering the ripple effect of 
social influence. In the example, when Bob buys the DVD, 
he may also buy the novel due to the item inference recom¬ 
mendation. Moreover, he may influence Cindy to purchase 
novel. However, the association rules behind item inference 
are derived without considering the ripple effect of social in¬ 
fluence. On the other hand, to promote the movie DVD, 
Alice may be selected as a seed for a viral marketing cam¬ 
paign, hoping to spread her influence to Bob and David to 
trigger additional purchases of the DVD. Actually, due to 
the effect of item inference recommendation, having Alice 

2 In this paper, we refer product/item recommendation 
based on associations among items inferred from purchase 
transactions as item inference recommendation. 








as a seed may additionally trigger purchases of the novel by 
Bob and Cindy. This is a factor that existing seed selection 
algorithms for viral marketing do not account for. 

We argue that to select seeds for maximizing the spread 
of product information to a customer base (or maximizing 
the sale revenue of products) in a viral marketing campaign, 
both effects of item inference and social influence need to 
be considered. To incorporate both effects, we propose a 
new model, called Social Item Graph (SIG) in form of hy¬ 
peredges, for capturing “purchase actions” of customers on 
products and their potential influence to trigger other pur¬ 
chase actions. Different from the conventional approaches 
[& El that use links between customers to model social 
relationship (for viral marketing) and links between items 
to capture the association (for item inference recommenda¬ 
tion), SIG represents a purchase action as a node (denoted 
by a tuple of a customer and an item), while using hyper¬ 
edges among nodes to capture the influence spread process 
used to predict customers’ future purchases. Unlike the pre¬ 
vious influence propagation models [a Eg consisting of only 
one kind of edges connecting two customers (in social in¬ 
fluence), the hyperedges in our model span across tuples of 
different customers and items, capturing both effects of so¬ 
cial influence and item inference. 

Based on SIG, we formulate the Social Item Maximiza¬ 
tion Problem (SIMP) to find a seed set, which consists of 
selected products along with targeted customers, to maxi¬ 
mize the total adoptions of products by customers. Note 
that SIMP takes multiple products into consideration and 
targets on maximizing the number of products purchased 
by customers^ SIMP is a very challenging problem, which 
does not have the submodularity property. We prove that 
SIMP cannot be approximated within n c with any c < 1, 
where n is the number of nodes in SIMP, i.e., SIMP is ex¬ 
tremely difficult to approximate with a small ratio because 
the best approximation ratio is almost nQ 

To tackle SIMP, two challenges arise: 1) numerous com¬ 
binations of possible seed nodes, and 2) expensive on-line 
computation of influence diffusion upon hyperedges. To ad¬ 
dress the first issue, we first introduce the Hyperedge-Aware 
Greedy (HAG) algorithm, based on a unique property of hy¬ 
peredges, i.e., a hyperedge requires all its source nodes to 
be activated in order to trigger the purchase action in its 
destination node. HAG selects multiple seeds in each seed 
selection iteration to further activate more nodes via hyper- 
edgesQ To address the second issue, we exploit the structure 
of Frequent Pattern Tree (FP-tree) to develop SIG-index as 
an compact representation of SIG in order to accelerate the 
computation of activation probabilities of nodes in online 
diffusion. 

Moreover, to construct realistic SIG models for SIMP, 
we also develop a statistical inference based framework to 
learn the weights of hyperedges from logs of purchase ac¬ 
tions. Identifying the hyperedges and estimating the cor- 


3 SIMP can be extended to a weighted version with differ¬ 
ent profits from each product. In this paper, we focus on 
maximizing the total sales. 

4 While there is no good solution quality guarantee for the 
worst case scenario, we empirically show that the algorithm 
we developed achieves total adoptions on average compara¬ 
ble to optimal results. 

5 A hyperedge requires all its source nodes to be activated 
to diffuse its influence to its destination node. 


responding weights are major challenges for constructing of 
a SIG due to data sparsity and unobservable activations. 
To address these issues, we propose a novel framework that 
employs smoothed expectation and maximization algorithm 
(EMS) [5Dj, to identify hyperedges and estimate their values 
by kernel smoothing. 

Our contributions of this paper are summarized as follows. 

• We observe the deficiencies in existing techniques for 
item inference recommendation and seed selection and 
propose the Social Item Graph (SIG) that captures 
both effects of social influence and item inference in 
prediction of potential purchase actions. 

• Based on SIG, we formulate a new problem, called 
Social Item Maximization Problem (SIMP), to select 
the seed nodes for viral marketing that effectively fa¬ 
cilitates the recommendations from both friends and 
stores simultaneously. In addition, we analyze the 
hardness of SIMP. 

• We design an efficient algorithm with performance guar¬ 
antee, called Hyperedge-Aware Greedy (HAG), and 
develop a new index structure, called SIG-index, to ac¬ 
celerate the computation of diffusion process in HAG. 

• To construct realistic SIG models for SIMP, we develop 
a statistical inference based framework to learn the 
weights of hyperedges from data. 

• We conduct a comprehensive evaluation on our pro¬ 
posals with various baselines. Experimental result val¬ 
idates our ideas and demonstrates the effectiveness and 
efficiency of the proposed model and algorithms over 
baselines. 

The rest of this paper is organized as follows. Section[2]re- 
views the related work. Section [3] details the SIG model and 
its influence diffusion process. Section [4] formulates SIMP 
and designs new algorithms to efficiently solve the problem. 
Section [5] describes our approach to construct the SIG. Sec- 
tionElreports our experiment results and Section[7]concludes 
the paper. 

2. RELATED WORK 

To discover the associations among purchased items, fre¬ 
quent pattern mining algorithms find items which frequently 
appear together in transactions [2]- Some variants, such as 
closed frequent patterns mining m, maximal frequent pat¬ 
tern mining E3, have been studied. However, those existing 
works, focusing on unveiling the common shopping behav¬ 
iors of individuals, disregard the social influence between 
customers [2B]. On the other hand, it has been pointed out 
that items recommended by item inference may have been 
introduced to users by social diffusion [25]. In this work, we 
develop a new model and a learning framework that consider 
both the social influence and item inference factors jointly to 
derive the association among purchase actions of customers. 
In addition, we focus on seed selection for prevalent viral 
marketing by incorporating the effect of item inference. 

With a great potential in business applications, social in¬ 
fluence diffusion in social networks has attracted extensive 
interests recently Ei Eg. Learning algorithms for estimat¬ 
ing the social influence strength between social customers 
have been developed 0E3- Based on models of social in¬ 
fluence diffusion, identifying the most influential customers 




Figure 2: A hyperedge example 

(seed selection) is a widely studied problem [6] [16]. Pre¬ 
cisely, those studies aim to find the best k initial seed cus¬ 
tomers to target on in order to maximize the population of 
potential customers who may adopt the new product. This 
seed selection problem has been proved as NP-hard [16] . 
Based on two influence diffusion models, Independent Cas¬ 
cade (IC) and Linear Threshold (LT), Kempe et al. propose 
a 1 — 1/e approximation greedy algorithm by exploring the 
submodularity property under IC and LT m- Some follow¬ 
up studies focus on improving the efficiency of the greedy 
algorithm using various spread estimation methods, e.g., 
MIA[6] and TIM+[23|. However, without considering the 
existence of item inference, those algorithms are not applica¬ 
ble to SIMP. Besides the IC and LT model, Markov random 
field has been used to model social influence and calculate 
expected profits from viral marketing [8]. Recently, Tang et 
al. proposed a Markov model based on “confluence”, which 
estimates the total influence by combining different sources 
of conformity [22] . However, these studies only consider the 
diffusion of a single item in business applications. Instead, 
we incorporate item inference in spread maximization to es¬ 
timate the influence more accurately. 

3. SOCIAL ITEM GRAPH MODEL 

Here we first present the social item graph model and then 
introduce the diffusion process in the proposed model. 

3.1 Social Item Graph 

We aim to model user purchases and potential activations 
of new purchase actions from some prior. We first define the 
notions of the social network and purchase actions. 

Definition 1. A social network is denoted by a directed 
graph G = ( V , E) where V contains all the nodes and E 
contains all the directed edges in the graph. Accordingly, a 
social network is also referred to as a social graph. 

Definition 2. Given a list of commodity items I and 
a set of customers V, a purchase action (or purchase for 
short), denoted by (v,i) where v G V is a customer, and 
i G I is an item, refers to the purchase of item i by cus¬ 
tomer v. 

Definition 3. An purchase log is a database consisting 
of all the purchase actions in a given period of time. 

Association-rule mining (called item inference in this pa¬ 
per) has been widely exploited to discover correlations be¬ 
tween purchases in transactions. For example, the rule 
(hotdog, bread} —¥ {pickle} obtained from the transactions 
of a supermarket indicates that if a customer buys hot- 
dogs and bread together, she is likely to buy pickles. To 
model the above likelihood, the confidence m of a rule 
{hotdog, bread} —{pickle} is the proportion of the trans¬ 
actions that have hotdogs and bread also include pickles. 


It has been regarded as the conditional probability that a 
customer buying both hotdogs and bread would trigger the 
additional purchase of pickles. To model the above rule in 
a graph, a possible way is to use two separate edges (see 
Figure [5] one from hotdog to pickle, and the other from 
bread to pickle, respectively), while the probability associ¬ 
ated with each of these edges is the confidence of the rule. 
In the above graph model, however, either one of the hot¬ 
dog or bread may trigger the purchase of pickle. This does 
not accurately express the intended condition of purchasing 
both the hotdog and bread. By contrast, the hyperedges in 
Graph Theory, by spanning multiple source nodes and one 
destination node, can model the above association rule (as 
illustrated in Figure[2]). The probability associated with the 
hyperedge represents the likelihood of the purchase action 
denoted by the destination node when all purchase actions 
denoted by source nodes have happened. 

On the other hand, in viral marketing, the traditional IC 
model activates a new node by the social influence proba¬ 
bilities associated with edges to the node. Aiming to cap¬ 
ture both effects of item inference and social influence. We 
propose a new Social Item Graph (SIG). SIG models the 
likelihood for a purchase (or a set of purchases) to trigger 
another purchase in form of hyperedges, which may have one 
or multiple source nodes leading to one destination node. 
We define a social item graph as follows. 

Definition 4. Given a social graph of customers G = 
(V, E) and a commodity item list I, a social item graph is 
denoted by Gsi = (Vsi , Eh) , where Vsi is a set of purchase 
actions and Eh is a set of hyperedges over Vsi■ A node 
n G Vsi is denoted as (v,i), where v G V and i G I. A 
hyperedge e G Eh is of the following form: 

{(ui,il), (U2,h), ■ ■ ■ , ( Um,im )} -»• (v, i) 

where Ui is in the neighborhood of v in G, i.e., Ui G Nq (u) = 
{u\d(u, v) < 1}0 

Note that the conventional social influence edge in a so¬ 
cial graph with one source and one destination can still be 
modeled in an SIG as a simple edge associated with a corre¬ 
sponding influence probability. Nevertheless, the influence 
probability from a person to another can vary for different 
items (e.g., a person’s influence on another person for cos¬ 
metics and smartphones may vary.). Moreover, although an 
SIG may model the purchases more accurately with the help 
of both social influence and item inference, the complexity of 
processing an SIG with hyperedges is much higher than sim¬ 
ple edges in the traditional social graph that denotes only 
social influence^ 

For simplicity, let u and v (i.e., the symbols in Typewriter 
style) represent the nodes (u, i) and ( v, i) in SIG for the rest 
of this paper. We also denote a hyperedge as e = U —> v, 

®Notice that when ui = U 2 = • • • = u m = v, the hyperedge 
represents the item inference of item i. On the other hand, 
when i 1 = i 2 = ■ ■ ■ = im = i, it becomes the social influence 
of user u on v. 

'To solve this issue, one approach is to transform a SIG 
with hyperedges to a graph without hyperedges, by replac¬ 
ing a hyperedge with multiple simple edges connecting to 
the sources and destinations, or by aggregating the source 
nodes and destination nodes into two nodes, respectively. 
Nevertheless, as to be shown in Section 4, the above strate¬ 
gies do not work. Also, a destination node can be activated 
only if all s ource nodes of the hyperedge are activated (see 
Section 13.21) . 





where U is a set of source nodes and v is the destination node. 
Let the associated edge weight be p e , which represents the 
activation probability for v to be activated if all source nodes 
in U are activated. Note that the activation probability is 
for one single hyperedge U — ¥ v. Other hyperedges sharing 
the same destination may have different activation probabil¬ 
ities. For example, part of the source nodes in a hyperedge 
{a, b, c, d} —> x can still activate x, e.g., by {a, b, c} —> x with 
a different hyperedge with its own activation probability. 

3.2 Diffusion Process in Social Item Graph 

Next we introduce the diffusion process in SIG, which is 
inspired by the probability-based approach behind Indepen¬ 
dent Cascade (IC) to captures the word-of-mouth behavior 
in the real world 10 This diffusion process fits the item 
inferences captured in an SIG naturally, as we can derive 
conditional probabilities on hyperedges to describe the trig¬ 
ger (activation) of purchase actions on a potential purchase. 

The diffusion process in SIG starts with all nodes inactive 
initially. Let S denote a set of seeds (purchase actions). Let 
a node s £ S be a seed. It immediately becomes active. 
Given all the nodes in a source set U at iteration 4 — 1, if 
they are all active at iteration 4 , a hyperedge e = U —► v has 
a chance to activate the inactive v with probability p e . Each 
node ( v , i) can be activated once, but it can try to activate 
other nodes multiple times, one for each incident hyperedges. 
For the seed selection problem that we target on, the total 
number of activated nodes represents the number of items 
adopted by customers (called total adoptions for the rest of 
this paper). 

4. SOCIAL ITEM MAXIMIZATION 

Upon the proposed Social Item Graph (SIG), we now for¬ 
mulate a new seed selection problem, called Social Item Max¬ 
imization Problem (SIMP), that selects a set of seed pur¬ 
chase actions to maximize potential sales or revenue in a 
marketing campaign. In Section 0 we will describe how to 
construct the SIG from purchase logs by a machine learning 
approach. 

Definition 5. Given a seed number k, a list of targeted 
items I, and a social item graph Gsi{Vsi, Eh), SIMP se¬ 
lects a set S of k seeds in Vsi such that oig si (S), the total 
adoption function of S, is maximized. 

Note that a seed in SIG represents the adoption/purchase 
action of a specific item by a particular customer. The total 
adoption function og si represents the total number of prod¬ 
uct items (€ I) purchased. By assigning prices to products 
and costs to the selected seeds, an extension of SIMP is to 
maximize the total revenue subtracted by the cost. 

Here we first discuss the challenges in solving SIMP be¬ 
fore introducing our algorithm. Note that, for the influence 
maximization problem based on the IC model, Kempe et 
al. propose a 1 — 1/e approximation algorithm [IB], thanks 

8 Notice that diffusion process in SIG is based on IC model 
since it only requires one diffusion probability parameter 
associated to each edge whereas LT model requires both 
influence degree of each edge and an influence threshold for 
each node. Moreover, several variants of IC model have 
been proposed [si m. However, they focus on modeling the 
diffusion process between users, such as aspect awareness [5], 
which is not suitable for social item graph since the topic is 
embedded in each SIG node. 
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Figure 5: An illustration of graph transformations 

to the submodularity in the problem. Unfortunately, the 
submodularity does not hold for the total adoption function 
OiG SI (S) in SIMP. Specifically, if the function qg s , (S) satis¬ 
fies the submodularity, for any node i and any two subsets of 
nodes Si and S 2 where Si C S 2 , ckg si (Si (J{i}) — ocg si (Si) > 
og S i (S2 |J{i}) — a G S i (S 2 ) should hold. However, a counter 
example is illustrated below. 

Example 1. Consider an SIMP instance with a cus¬ 
tomer and five items in Figure 0 Consider the case where 
Si = {1x4}, S 2 = {111,114}, and i corresponds to node 
U2. For seed sets {114}, {1x2,114}, {111,114} and {1x1,112,114}, 
j({ 114 }) — 1.9, QG S r ({“2,1x4}) — 2 . 9 , ogsj ({ixi, 1x4}) — 
2 . 9 , and q;g S i({ixi, 1x2,1x4}) = 4 . 4 . Thus, cx.g si {Si U{“2}) — 
<XG S i(Si) = 1 < 1.5 = aG S i (& U{“ 2 }) -aG SI (S 2 ). Hence, 
the submodularity does not hold. 

Since the submodularity does not exist in SIMP, the 1 — 
1/e approximation ratio of the greedy algorithm in [16] does 
not hold. Now, an interesting question is how large the 
ratio becomes. Example [2] shows an SIMP instance where 
the greedy algorithm performs poorly. 

Example 2 . Consider an example in Figure 0 where 
nodes vi, v 2 ,...,vm all have a hyperedge with the probabil¬ 
ity as 1 from the same k sources ixi, u 2 ,..., uk, and e is 
an arbitrarily small edge probability t > 0. The greedy al¬ 
gorithm selects one node in each iteration , i.e., it selects 
u[, i4...ix{ as the seeds with a total adoption k + ke. How¬ 
ever, the optimal solution actually selects 1 x 1 , u 2 ,..., Uk as 
the seeds and results in the total adoption M + k. There¬ 
fore, the approximation ratio of the greedy algorithm is at 
least (M -\-k)/(k-\-ke), which is close to M/k for a large M, 
where M could approach |Vs/| in the worst case. 

One may argue that the above challenges in SIMP may be 
alleviated by transforming Gsi into a graph with only sim¬ 
ple edges, as displayed in Figure 0 where the weight of every 
ui —> v) with u, £ U can be set independently. However, if 
a source node u m £ U of v is difficult to activate, the prob¬ 
ability for v to be activated approaches zero in Figure 0(a) 
due to u m . However, in Figure 0 (b), the destination v is 
inclined to be activated by sources in U, especially when U 
is sufficiently large. Thus, the idea of graph transformation 
does not work. 

4.1 Hyperedge-Aware Greedy (HAG) 

Here, we propose an algorithm for SIMP, Hyperedge- 
Aware Greedy (HAG), with performance guarantee. The 



approximation ratio is proved in Section m A hyperedge 
requires all its sources activated first in order to activate 
the destination. Conventional single node greedy algorithms 
perform poorly because hyperedges are not considered. To 
address this important issue, we propose Hyperedge-Aware 
Greedy (HAG) to select multiple seeds in each iteration. 

A naive algorithm for SIMP would examine, C]^ SI com¬ 
binations are involved to choose k seeds. In this paper, as 
multiple seeds tend to activate all source nodes of a hyper¬ 
edge in order to activate its destination, an effective way is 
to consider only the combinations which include the source 
nodes of any hyperedge. We call the source nodes of a hy¬ 
peredge as a source combination. Based on this idea, in each 
iteration, HAG includes the source combination leading to 
the largest increment on total adoption divided by the num¬ 
ber of new seeds added in this iteration. Note that only the 
source combinations with no more than k sources are con¬ 
sidered. The iteration continues until k seeds are selected. 
Note that HAG does not restrict the seeds to be the source 
nodes of hyperedges. Instead, the source node u of any sim¬ 
ple edge u —> v in SIG is also examined. 

Complexity of HAG. To select k seeds, HAG takes at 
most k rounds. In each round, the source combinations of 
\Eh\ hyperedges are tried one by one, and the diffusion cost 
is Cdif, which will be analyzed in Section|T2] Thus, the time 
complexity of HAG is 0(k x \Eh\ x Cdif ). 

4.2 Acceleration of Diffusion Computation 

To estimate the total adoption for a seed set, it is neces¬ 
sary to perform Monte Carlo simulation based on the diffu¬ 
sion process described in Section T3. 2 1 for many times. Find¬ 
ing the total adoption is very expensive, especially when a 
node v can be activated by a hyperedge with a large source 
set U, which indicates that there also exist many other hy¬ 
peredges with an arbitrary subset of U as the source set to 
activate v. In other words, enormous hyperedges need to be 
examined for the diffusion on an SIG. It is essential to reduce 
the computational overhead. To address this issue, we pro¬ 
pose a new index structure, called SIG-index, by exploiting 
FP-Tree El to pre-process source combinations in hyper¬ 
edges in a compact form in order to facilitate efficient deriva¬ 
tion of activation probabilities during the diffusion process. 

The basic idea behind SIG-index is as follows. For each 
node v with the set of activated in-neighbors in iteration 
t, if v has not been activated before t, the diffusion process 
will try to activate v via every hyperedge U —> v where the 
last source in U has been activated in iteration i— 1. To derive 
the activation probability of a node v from the weights of 
hyperedges associated with v, we first define the activation 
probability as follows. 

Definition 6. The activation probability of v at l is 

ap v ,L = 1 - ]^[ (1 Pu—tv) • 

u->ve e h , vcn vl _ 1 _ 2 

where N ViL -i and N ViL -2 denote the set of active neighbors 
of v in iteration i — 1 and l — 2, respectively. 

The operations on an SIG-index occur two phases: Index 
Creation Phase and Diffusion Processing Phase. As all hy¬ 
peredges satisfying Definition [B] must be accessed, the SIG- 
index stores the hyperedge probabilities in Index Creation 
Phase. Later, the SIG-index is updated in Diffusion Pro¬ 
cessing Phase to derive the activation probability efficiently. 



Figure 6: An illustration of SIG-index 

Index Creation Phase. For each hyperedge U -4 v, 
we first regard each source combination U = {vi,..-V|u|} as a 
transaction to build an FP-tree m by setting the minimum 
support as 1. As such, vi,...V| D i forms a path r —> vi —> 
V2... -4 V|u| from the root r in the FP-tree to node V| D | in 
U. Different from the FP-Tree, the SIG-index associates the 
probability of each hyperedge U —> v with the last source 
node V| D | in U@ Initially the probability associated with 
the root r is 0. Later the the SIG-index is updated during 
the diffusion process. Example [3] illustrates the SIG-index 
created based on an SIG. 

Example 3. Consider an SIG graph with five nodes, 
V1-V5, and nine hyperedges with their associated probabil¬ 
ities in parentheses 0 {vi} -4 V 5 (0.5), {vi,V 2 } -A V 5 
(04), {vi,v 2 ,v 3 } -4- v 5 (0.2), {vr,v 2 ,v 3 ,v 4 } -4 v 5 (0.1), 
{vi,v 3 } -4 v 5 (0.3), {vi,v 3 ,v 4 } -4 v 5 (0.2), {v 2 } -4 v 5 
(0.2), {v 2 , v 3 , v 4 }-)• v 5 (0.1), {v 2 ,v 4 } y V 5 (0.1). Figured 
(a) shows the SIG-index initially created for node vs- 

Diffusion Processing Phase. The activation probabil¬ 
ity in an iteration is derived by traversing the initial SIG- 
index, which takes 0(\Eh\) time. However, a simulation 
may iterate a lot of times. To further accelerate the travers¬ 
ing process, we adjust the SIG-index for the activated nodes 
in each iteration. More specifically, after a node v a is ac¬ 
tivated, accessing an hyperedge U — > v with v“ £ U be¬ 
comes easier since the number remaining inactivated nodes 
in U — {t“} is reduced. Accordingly, SIG-index is modified 
by traversing every vertex labeled as v“ on the SIG-index in 
the following steps. 1) If ?“ is associated with a probability 
p a , it is crucial to aggregate the old activation probabilities 
p a of v“ and p v of its parent v p , and update activation prob¬ 
ability associated with v p as 1 — (1 — p a )( 1 — p P ), since the 
source combination needed for accessing the hyperedges as¬ 
sociated with v a and v p becomes the same. The aggregation 
is also performed when v p is r. 2) If v“ has any children c, 
the parent of c is changed to be v p , which removes the pro¬ 
cessed v“ from the index. 3) After processing every node v“ 
in the SIG-index, we obtain the activation probability of v 
in the root r. After the probability is accessed for activating 
v, the probability of r is reset to 0 for next iteration. 

Example 4. Consider an example with V2 activated in 
an iteration. To update the SIG-index, each vertex V2 in 
Figure^ (a) is examined by traversing the linked list of v 2. 
First, the left vertex with label V2 is examined. SIG-index 
reassigns the parent of V2 ’s child (labeled as v 3 ) to the vertex 

9 For ease of explanation, we assume the order of nodes in 
the SIG-index follows the ascending order of subscript. 
10 For simplicity, the hyperedges in this example only have V5 
as the destination. 
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Figure 7: An illustration instance built for 3-SAT 

labeled as vi, and aggregate the probability 0-4 on the V 2 and 
0.5 on vertex vi, since the hyperedge {vi,v 2 } —> vs can be 
accessed if the node vi is activated later. The probability of 
v\ becomes 1 — (1— p vi )(l— Pv 2 ) = 0.7. Then the right vertex 
with label V 2 is examined. The parent of its two children is 
reassigned to the root r. Also, the probability of itself (0.2) 
is aggregated with the root r, indicating that the activation 
probability of node vs in the next iteration is 0.2. 

Complexity Analysis. For Index Creation Phase, the 
initial SIG-index for v is built by examining the hyperedges 
two times with 0 (\Eh\) time. The number of vertices in 
SIG-index is at most 0 (ccl\Eh\), where Cd is the number 
of source nodes in the largest hyperedge. During Diffusion 
Processing Phase, each vertex in SIG-index is examined only 
once through the node-links, and the parent of a vertex is 
changed at most 0(cd) times. Thus, the overall time to 
complete a diffusion requires at most 0(cd\Eii\) time. 

4.3 Hardness Results 

From the discussion earlier, it becomes obvious that SIMP 
is difficult. In the following, we will prove that SIMP is inap- 
proximable with a non-constant ratio n c for all c < 1 , with 
a gap-introducing reduction from an NP-complete problem 
3-SAT to SIMP, where n is the number of nodes in an SIG. 

Given an expression (j> in CNF, in which each clause has 
three variables, 3-SAT is to decide whether <j> is satisfi- 
able. The reduction includes two parts: 1) If 4> is satisfi- 
able, its transformed SIMP instance has optimal total adop¬ 
tion larger than as at- 2) If <f> is unsatisfiable, its trans¬ 
formed SIMP instance has optimal total adoption less than 
otuNSAT- (Refer Lemma[l]for actual asAT and auNSAT■) 
Then inapproximability obtained by this gap-introducing in¬ 
duction is asAT . Note that an asAT —approximation 
algorithm is able to solve 3-SAT because it always returns a 
solution larger than x as at = auNSAT for satisfi- 

able <j> and a solution smaller than Ounsat for unsatisfiable 
4>, implying such approximation algorithm must not exist. 
Also note that the theoretical result only shows that for any 
algorithm, there exists a problem instance of SIMP (i.e., a 
pair of an SIG graph and a seed number k ) that the algo¬ 
rithm can not obtain a solution better than 1/n times the 
optimal solution. It does not imply that an algorithm always 
performs badly in every SIMP instance. 

Lemma 1. For a positive integer q, there is a gap- 
introducing reduction from 3-SAT to SIMP, which trans¬ 
forms an n va r-variables expression <j> to an SIMP instance 


with the SIG as Gsi(Vsi , Eh) and the k as n var such that 

• if <f> is satisfiable, a Gsl > ( m c ia + 3 n var ) q , and 

• if (j> is not satisfiable, a* Gsl < m c i a + 3n va r, 

where a Gsl is the optimal solution of this instance, n var is 
the number of Boolean variables, and m c i a is the number of 
clauses. Hence there is no (m c i a + 3n var ) q ~ 1 approximation 
algorithm for SIMP unless P = NP. 

Proof. Given a positive integer q, for an instance 4> of 
3-SAT with n var Boolean variables ai,..., a nvar and m c ; a 
clauses Ci,..., Cm cla , we construct an SIG Gsi with three 
node sets X, Y and Z as follows. 1) Each Boolean variable 
ai corresponds to two nodes x t , x, in X and one node y i in 
Y. 2) Each clause Ck corresponds to one node c k in Y. 3) 
Z has (|X| + |Y|)' 3 ’ nodes. (Thus, Gsi has (m c ; a + 3n var ) q + 
m c ia + 3 n var nodes.) 4 ) For each y • in Y, we add direct 
edges Xj H> y. and Xj —y.. 5) For each c k in Y, we add 
direct edges a —> c k , (3 —> Ck and 7 —> Ck, where a , /3, 7 
are the nodes in X corresponding to the three literals in Ck- 
6 ) We add a hyperedge Y —> z„ from all for every z„ G Z. 
The probability of every edge is set to 1. An example is 
illustrated in Figure [7] 

We first prove that <j> is satisfiable if and only if Gsi has 
a seed set S with n va r seeds and the total adoption of S 
contains Y. If tj> is satisfiable, there exists a truth assignment 
T on Boolean variables a 1 ,..., a nvar satisfying all clauses of 
tj>. Let S = (xi|T(ai) = 1} U {xj\T (a,) = 0}, and S then has 
n va r nodes and the total adoption of S contains Y. On the 
other hand, if </> is not satisfiable, apparently there exists 
no seed set S with exactly one of Xi or x, selected for every 
i such that the total adoption of S contains Y. For other 
cases, 1) all seeds are placed in X, but there exists at least 
one i with both Xi and x, selected. In this case, there must 
exist some j such that none of Xj or xj are selcted (since the 
seed number is n var ), and thus Y is not covered by the total 
adoption of S. 2) A seed is placed in Y. In this case, the seed 
can be moved to an adjacent Xi without reducing the total 
adoption. Nevertheless, as explained above, there exists no 
seed set S with all seeds placed in X such that the total 
adoption of S contains Y, and thus the total adoption of any 
seed set with a seed placed in Y cannot cover Y, either. With 
above observations, if <j> is not satisfiable, Gsi does not have 
a seed set S with n var seeds such that the total adoption of 
S contains Y. Since the nodes of Z can be activated if and 
only if the total adoption of S contains Y if and only if <f> is 
satisfiable, we have 

• if <j> is satisfiable, a Gsl > ( m c i a + 3 n va r) q , and 

• if f> is not satisfiable, a Gsl < m c i a + 3 n var . 

The lemma follows. □ 

Theorem 1. For any e > 0, there is no n 1-e approxima¬ 
tion algorithm for SIMP, assuming P ^ NP. 

Proof. For any arbitrary t > 0, we set q > -. Then, by 
Lemma[TJ there is no ( rn c ia + 3n va r ) q ~ 1 approximation algo¬ 
rithm for SIMP unless P = NP. Then (m c ; a + 3 n var ) q l > 
2 [m c ia + 3n va r) q ~ 2 > 2(m c ia + 3n„ ar ) 9(1-e) > ( 2 (m c i a + 
3n va r) q ) 1 ~ e > n 1 ~ c . Since e is arbitrarily small, thus for 
any e > 0 , there is no n 1_e approximation algorithm for 
SIMP, assuming P ^ NP. The theorem follows. □ 

With Theorem [3] no algorithm can achieve an approxi¬ 
mation ratio better than n. In Theorem [l] we prove that 
SIG-index is correct, and HAG with SIG-index achieves the 
best ratio, i.e., it is n-approximated to SIMP. Note that 








the approximation ratio only guarantees the lower bound 
of total adoption obtained by HAG theoretically. Later in 
Section T6.31 we empirically show that the total adoption ob¬ 
tained by HAG is comparable to the optimal solution. 

Theorem 2. HAG with SIG-index is n-approximated, 
where n is the number of nodes in SIG. 

Proof. First, we prove that SIG-index obtains ap v , L cor¬ 
rectly. Assume that there exists an incorrect ap V}L , i.e., there 
exists an hyperedge U —¥ v satisfying the conditions in Defi¬ 
nition [ 6 ] (i.e., U (£ A Vj1 _2 and U C jV T ,i-i) but its probability 
is not aggregated to r in l. However, the probability can 
not be aggregated before t since U N VA -2 and it must 
be aggregated no later than l since U C iV Til _i. There is a 
contradiction. 

Proving that HAG with SIG-index is an n-approximation 
algorithm is simple. The upper bound of total adoption for 
the optimal algorithm is n, while the lower bound of the total 
adoption for HAG is 1 because at least one seed is selected. 
In other words, designing an approximation algorithm for 
SIMP is simple, but it is much more difficult to have the 
hardness result for SIMP, and we have proven that SIMP is 
inapproximable within n 1-e for any arbitrarily small e. □ 

Theorem 3. For any e > 0, there is no n 1 ~ e approxima¬ 
tion algorithm for SIMP, assuming P ^ NP. 

PROOF. For any arbitrary e > 0 , we set q > Then, by 
LemmaU there is no ( rn c i a + 3n var ) q approximation algo¬ 
rithm for SIMP unless P = NP. Then ( m c i a + invar) 9-1 > 
2 (m c j a + 3 n var ) 9 ~ 2 > 2 (m c i a + 3n„ a r-) 9(1 ~ e) > ( 2 (m c i a + 
invar) 9 ) 1 e > n 1-e . Since e is arbitrarily small, thus for 
any e > 0, there is no n 1_e approximation algorithm for 
SIMP, assuming P ^ NP. The theorem follows. □ 

Corollary 1. HAG with SIG-index is n-approximated, 
where n is the number of nodes in SIG. 

Proving that HAG with SIG-index is an n-approximation 
algorithm is simple. The upper bound of total adoption for 
the optimal algorithm is n, while the lower bound of the 
total adoption for HAG is 1 because at least one seed is 
selected. Note that SIG-index is only for acceleration and 
does not change the solution quality. 

5. CONSTRUCTION OF SIG 

To select seeds for SIMP, we need to construct the SIG 
from purchase logs and the social network. We first create 
possible hyperedges by scanning the purchase logs. Let r 
be the timestamp of a given purchase v = (v,i). v’s friends 
purchase and her own purchases that have happened within 
a given period before r are considered as candidate source 
nodes to generate hyperedges to v[3 For each hyperedge e, 
the main task is then the estimation of its their activation 
probability p e . Since p e is unknown, it is estimated by max¬ 
imizing the likelihood function based on observations in the 
purchase logs. Note that learning the activation probability 
p e for each hyperedge e faces three challenges. 

11 The considered periods of item inference and social influ¬ 
ence can be different since social influence usually requires 
a longer time to propagate the messages while the item in¬ 
ference on the e-commerce websites can happen at the same 
time. The d etail setting of time period will be discussed in 
Section IQ 


Cl. Unknown distribution of p e . How to properly 
model Pe is critical. 

C2. Unobserved activations. When v is activated at 
time r, this event only implies that at least one hyperedge 
successfully activates v before t. It remains unknown which 
or hyperedge(s) actually triggers v, i.e., it may be caused 
by either the item inference or social influence or both. 
Therefore, we cannot simply employ the confidence of an 
association-rule as the corresponding hyperedge probability. 
C3. Data Sparsity. The number of activations for a user 
to buy an item is small, whereas the number of possible hy¬ 
peredge combinations is large. Moreover, new items emerge 
every day in e-commerce websites, which incurs the notori¬ 
ous cold-start problem. Hence, a method to deal with the 
data sparsity issue is necessary to properly model a SIG. 

To address these challenges, we exploit a statistical infer¬ 
ence approach to identify those hyperedges and learn their 
weights. In the following, we first propose a model of the 
edge function (to address the first challenge) and then ex¬ 
ploit the smoothed expectation and maximization (EMS) 
algorithm [ 20 ] to address the second and third challenges. 

5.1 Modeling of Hyperedge Probability 

To overcome the first challenge, one possible way is to 
model the number of success activations and the number of 
unsuccessful activations by the binomial distributions. As 
such, pe is approximated by the ratio of the number of suc¬ 
cess activations and the number of total activation trials. 
However, the binomial distribution function is too complex 
for computing the maximum likelihood of a vast number of 
data. To handle big data, previous study reported m that 
the binomial distribution (n, p) can be approximated by the 
Poisson distribution A = np when the time duration is suf¬ 
ficiently large. According to the above study, it is assumed 
that the number of activations of a hyperedge e follows the 
Poisson distribution to handle the social influence and item 
inference jointly. The expected number of events equals to 
the intensity parameter A. Moreover, we use an inhomoge¬ 
neous Poisson process defined on the space of hyperedges to 
ensure that p e varies with different e. 

In the following, a hyperedge is of size n, if the cardinality 
of its source set U is n. We denote the intensity of the number 
of activation trials of the hyperedge e as \r(e). Then the 
successful activations of hyperedge e follows another Poisson 
process where the intensity is denoted by AA(e). Therefore, 
the hyperedge probability p e can be derived by parameters 
X A {e) and A T (e), i.e., p e = ^{f}. 

The maximum likelihood estimation can be employed to 
derive Xr(e). Nevertheless, X A (e) cannot be derived as ex¬ 
plained in the second challenge. Therefore, we use the ex¬ 
pectation maximization (EM) algorithm, which is an ex¬ 
tension of maximum likelihood estimation containing latent 
variables to X A (e) which is modeled as the latent variable. 
Based on the observed purchase logs, the E-step first de¬ 
rives the likelihood Q-function of parameter p e with Aa(c) 
as the latent variables. In this step, the purchase logs and 
p e are given to find the probability function describing that 
all events on e in the logs occur according to p e , whereas 
the probability function (i.e., Q-function) explores different 
possible values on latent variable X A {e). Afterward, The M- 
step maximizes the Q-function and derives the new p e for 
E-Step in the next iteration. These two steps are iterated 
until convergence. 




With the employed Poisson distribution and EM algo¬ 
rithm, data sparsity remains an issue. Therefore, we further 
exploit a variant of EM algorithm, called EMS algorithm 
[201 . to alleviate the sparsity problem by estimating the in¬ 
tensity of Poisson process using similar hyperedges. The 
parameter smoothing after each iteration is called S-Step, 
which is incorporated in EMS algorithm, in addition to the 
existing E-Step and M-Step. 

5.2 Model Learning by EMS Algorithm 

Let p e and p e denote the true probability and estimated 
probabilities for hyperedge e in the EMS algorithm, respec¬ 
tively, where e = U —tv. Let Nu and K e denote the num¬ 
ber of activations of source set U in the purchase logs and 
the number of successful activations on hyperedge e, respec¬ 
tively. The EM algorithm is exploited to find the maximum 
likelihood ofp e , while AA(e) is the latent variable because K e 
cannot be observed (i.e., only Nu can be observed). There¬ 
fore, E-Step derives the likelihood function for {p e } (i.e., the 
Q-function) as follows, 

0(Pe,pi i_1) ) = Ek b [log P(K e , Nu |p e ) I Nu , Pe* _ ^ ], (1) 

where pi* -1 '* is the hyperedge probability derived in the 
previous iteration, Note that Nu and pi* -1 ** are given pa¬ 
rameters in iteration i, whereas p e is a variable in the Q- 
function, and K e is a random variable governed by the dis¬ 
tribution P(Lf e |iV( 7 ,pi* -1 *). Since p e is correlated to A t(U) 
and AA(e), we derive the likelihood P(K e , Nu\p e ) as follows. 


P(K e ,Nu\p e ) 

= P ({ K e } ee E H , {Nu}ucv S i\{Pe}eeE H , {At (U) }t/c v s i ) 

= P ({K e }eEE H \{Pe}eEE H , {Nu, \t(U)}u<ZV S i) 

xP {{Nu}ucv si \{Xt{U)}ucv si ) ■ 


It is assumed that {K e } is independent with {TVc/}, and 
Q (pe, pi* —1 ^) can be derived as follows: 

E log P{K e \Nu ,Pe) + log P{{Nu}ucv S i |{AT(e)}(7cv S j)- 

eEE H 

Since only the first term contains the hidden K e , only this 
term varies in different iterations of the EMS algorithm, be¬ 
cause {Nu}ucv S i i n the second term always can be derived 
by finding the maximum likelihood as follows. Let pu,k de¬ 
note the probability that the source set U exactly tries to ac¬ 
tivate the destination node k times, i.e., pu,k = P{Nu = k}. 
The log-likelihood of At is 


E^fc ln ( 

k 



= ^2 Pu,k{—^T + k In At 

k 


= —At + (In At) kpu t k — El Pu,k In k\. 

k k 


In fc!) 


We acquire the maximum likelihood by finding the derivative 
with regard to At: 

- 1 + E kpu tk = 0. (2) 

T k 

Thus, the maximum log-likelihood estimation of At = 
kpu,k, representing that the expected activation times 
(i.e., \r(e)) is Nu- Let A = {(v, r)} denote the action log 
set, where each log (v, r) represents that v is activated at 


time r. Nu is calculated by scanning A and find the times 
that all the nodes in U are activated. 

Afterward, we focus on the first term of Q(p e ,Pe* -1 ^). Let 
Pe,k = P{K e = fc} denote the probability that the hyperedge 
e exactly activates the destination node k times. In E-step, 
we first find the expectation for K e as follows. 

E E Pe,k \0g(k\Nu,Pe) 

e£_Efj k= l,--- ,Njj 

= E Pe,felogP(fc|A r c/,Pe) 

k=,N V 

= ^ Pe,k^Og{( N ^'\p e k {l-p e ) Nu ~ k ) 

k=l,-“ ,Njj \ J 

= ^2 Pe ’ k ( l0g ( ) T fc l°gPe + (Nu - fc)log(l -Pe 

k =!,••• ,Nu V V / 


Since J2k=i,- ,Ny P^k = E[K e \ and J2k=i,-, n v Pe,k 
1, the log-likelihood of the first term is further simplified 

E P £ ’ k lo S ( E ) + N u l°g(l - Pe) 

k=l,---,Nu \ 

+ £[A' e )](l 0 gPe - log(l - Pe))- 

Afterward, M-step maximizes the Q-function by finding the 
derivative with regard to p e : 

1 


+ E[K e )]{-+ - - 

1 — p e p e 1 ~ Pe 

p e = E[K e ]/Nu 


= 0 


Therefore, the maximum likelihood estimator p e is , 

A T (U) is Nu, and \ A (e) = E\K e \. 

The problem remaining is to take expectation of the la¬ 
tent variables {Ke} in E-step. Let {we, a }eEE H ,o.=(v,t)ea be 
the conditional probability that v is activated by the source 
set U of e at t given v is activated at r, and let E a denote 
the set of candidate hyperedges containing every possible 
e with its source set activated at time t — 1, i.e., E a = 
{( 111 , 112 , ■ • ■ ,u„) -t v\Vi — 1, • • • ,n,ui € AT(vj), (u,,t — 1) € 
A}. It’s easy to show that given the estimation of the 

probability of hyperedges, w e a = -rrrf —-—t , since 

1 II e'EE a V 1 P e ' > 

1 — YleEE (1 — Pe 1 ) is the probability for v to be acti¬ 
vated by any hyperedge at time r. The expectation of 
K e is j2aeA,eeE a nE H n w e,a, he., the sum of expectation 
of each successful activation of v from hyperedge e, and 
En,n = {(ui, U 2 , * * • ,u n ;u) E Eh} contains all size n hy¬ 
peredges. 

To address the data sparsity problem, we leverage infor¬ 
mation from similar hyperedges (described later). There¬ 
fore, our framework includes an additional step to smooth 
the results of M-Step. Kernel smoothing is employed in S- 
Step. 

In summary, we have the following steps: 

E-Step: 

E[Ke\ = J2 We ’“’ 

aEA,eEE a E\E Hn 

Pe 


1 II e'EE a (1 Pe') 








M-Step: 


Ea6A,e6B a n£ Hi „ We ’ a 


Aa (e) = ^2 W<=,a, 

aeA,eeE a r\E H:n 

At (U) = Nu. 

S-Step: To address the data sparsity problem, we lever¬ 
age information from similar hyperedges (described later). 
Therefore, in addition to E-Step and M-Step, EMS in¬ 
cludes S-Step, which smooths the results of M-Step. Kernel 
smoothing is employed in S-Step as follows: 

Aa (e) = J2 w el , a L h (F(e)-F(e')) 

aeA,e'eE a rE Hri 

A t{U)= J2 NuL h (F{U)-F(U')) 

UCVsi 

where Lh is a kernel function with bandwidth h, and F is 
the mapping function of liyperedges, i.e., F(e) maps a hy¬ 
peredge e to a vector. The details of dimension reduction for 
calculating F to efficiently map hyperedges into Euclidean 
space are shown in the next subsection. If the hyperedges 
e and e! are similar, the distance of the vectors F(e) and 
F(e') is small. Moreover, a kernel function Lh{x) is a pos¬ 
itive function symmetric at zero which decreases when |t| 
increases, and the bandwidth h controls the extent of aux¬ 
iliary information taken from similar hyperedges(3 Intu¬ 
itively, kernel smoothing can identify the correlation of p ei 
with ei = Ui —> vi and p e2 with e 2 = U 2 —> V 2 for nearby vi 
and V 2 and similar Ui and U 2 . 

5.3 Dimension Reduction 

To facilitate the computation of smoothing function in 
EMS algorithm, we exploit a dimension reduction technique 
mi m to map a graph into Euclidean space as a set of 
vectors. Specifically, given N users, let Z G R JVxJV denote 
the projection matrix, where Zi is the z-th row of Z and is 
the projection of vertex Vi. We derive Z as follows: 

Z = arg min z T Lz, 

Z T IZ=c 

L = D-W,Du = J2 

where D is a diagonal matrix, I is the identity matrix, and 
L is the Laplacian matrix of distance matrix W G R NxN . 
This optimization problem attempts to preserve the distance 
between nodes. However, the constraint Z T IZ = c restricts 
that only c columns of Z can be non-zero vector. Therefore, 
the objective function reduces the dimension of 2 , from N to 
c while maintaining the local structure as much as possible. 

Suppose we solve this generalized eigenproblem for first 
c solutions and the z-th component of j'-th eigenvector Zj 
is denoted as Zjj. The projection of z-th node u,; has a K- 
dimensional representation f(vi) = ( 21 ,z, 22 , 1 , • ■ • ,, 2x,z). In 
our paper, we employ one of the most widely used nonlinear 
dimension reduction technique, ISOMAP [21 only. The dis¬ 
tance matrix W = -HSH/2, where H = I-1/N7~? T (I is 
the identity matrix and 1 is the vector of all ones) and Sij 


is distance between two nodes in the graph. For our SIG, we 
first project the users whose graph is G, and then the items 
whose graph is set to be the complete graph. Other method 
sharing the above optimization formulation can be used as 
a substitute without much effort. 

By employing the above graph embedding approaches, we 
project the graph into Euclidean space while preserving the 
distance between the nodes locally. Therefore, the customers 
who are socially near and the similar commodity items can 
be extracted efficiently. Therefore, after the dimension re¬ 
duction procedure, each node has a A-dimensional repre¬ 
sentation. Therefore, each hyperedge e of size n comprising 
of n source nodes and 1 destination node can be mapped to 
a vector on the space rO+U-K/ 

6. EVALUATIONS 

We conduct comprehensive experiments to evaluate the 
proposed SIG model, learning framework and seed selection 
algorithms. In Section [ 6 .II we discuss the data preparation 
for our evaluation. In Section m we compare the predic¬ 
tive power of the SIG model against two baseline models: 
i) independent cascade (IC) model learned by implementing 
E3 and ii) the generalized threshold (GT) model learned 
by [10 B In addition, we evaluate the learning framework 
based on the proposed EM and EMS algorithms. Next, in 
Section EH we evaluate the proposed HAG algorithm for 
SIMP in comparison to a number of baseline strategies, in¬ 
cluding random, single node selection, social, and item ap¬ 
proaches. Finally, in Section 16.41 we evaluate alternative 
approaches for diffusion processing, which is essential and 
critical for HAG, based on SIG-index, Monte Carlo simula¬ 
tions and sorting enhancement. 

6.1 Data Preparation 

Here, we conduct comprehensive experiments using three 
real datasets to evaluate the proposed ideas and algorithms. 
The first dataset comes from Douban [l], a social networking 
website allowing users to share music and books with friends. 
Dataset Douban contains 5, 520, 243 users and 86 , 343, 003 
friendship links, together with 7, 545, 432 (user, music) and 
14, 050, 265 (user, bookmark) pairs, representing the music 
noted and the bookmarks noted by each user, respectively. 
We treat those (user, music) and (user, bookmark) pairs as 
purchase actions. In addition to Douban, we adopt two pub¬ 
lic datasets, i.e., Gowalla and Epinions. Dataset Gowalla 
contains 196, 591 users, 950, 327 links, and 6 , 442, 890 check¬ 
ins [7j. Dataset Epinions contains 22,166 users, 335,813 
links, 27 categories of items, and 922, 267 ratings with times¬ 
tamp m- Notice that we do not have data directly reflect¬ 
ing item inferences in online stores, so we use the purchase 
logs for learning and evaluations. The experiments are im¬ 
plemented in an HP DL580 server with 4 Intel Xeon E7-4870 
2.4 GHz CPUs and 1 TB RAM. 

We split all three datasets into 5-fold, choose one subsam¬ 
ple as training data, and test the models on the remaining 
subsamples. Specifically, we ignore the cases when the user 
and her friends did not buy anything. Finally, to evalu¬ 
ate the effectiveness of the proposed SIG model (and the 
learning approaches), we obtain the purchase actions in the 
following cases as the ground truth: 1 ) item inference - a 
user buys some items within a short period of time; and 2 ) 

13 http://people.cs.ubc.ca/~welu/downloads.html 


12 A symmetric Gaussian kernel function is often used [12]. 





Table 1: Comparison of precision, recall, and FI for three models on Douban, Gowalla, Epinions 


Dataset 

Douban 

Gowalla 

Epinions 


Precision 

Recall 

Fl-Score 

Precision 

Recall 

Fl-Score 

Precision 

Recall 

Fl-Score 

GT 

IC 

SIG 

0.420916 

0.448542 

0.869348 

0.683275 

0.838615 

0.614971 

0.520927 

0.584473 

0.761101 

0.124253 

0.217694 

0.553444 

0.435963 

0.579401 

0.746408 

0.171214 

0.323537 

0.646652 

0.142565 

0.172924 

0.510118 

0.403301 

0.799560 

0.775194 

0.189999 

0.247951 

0.594529 
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Figure 8: Comparisons of precision and FI in various 
fi and h on Epinions. 


a user buys an item after at least one of her friends bought 
the item. The considered periods of item inference and so¬ 
cial influence are set differently according to [14] and [28], 
respectively. It is worth noting that only the hyperedges 
with the probability larger than a threshold parameter 9 
are considered. We empirically tune 9 to obtain the default 
setting based on optimal Fl-Score. Similarly, the threshold 
parameter 9 for the GT model is obtained empirically. The 
reported precision, recall, and FI are the average of these 
tests. Since both SIGs and the independent cascade model 
require successive data, we split the datasets into continuous 
subsamples. 

6.2 Model Evaluation 

Tables |T] present the precision, recall, and FI of SIG, 
IC and GT on Douban, Gowalla, and Epinions. All three 
models predict most accurately on Douban due to the large 
sample size. The SIG model significantly outperforms the 
other two models on all three datasets, because it takes into 
account both effects of social influence and item inference, 
while the baseline models only consider the social influence. 
The difference of FI score between SIG and baselines is more 
significant on Douban, because it contains more items. Thus, 
item influence plays a more important role. Also, when the 
user size increases, SIG is able to extract more social in¬ 
fluence information leading to better performance than the 
baselines. The offline training time is 1.68, 1.28, and 4.05 
hours on Epinions, Gowalla, Douban, respectively. 

To evaluate the approaches adopted to learn the activation 
probabilities of hyperedges for construction of SIG, Fig. [H] 
compares the prevision and FI of EMS and EM algorithms 
on Epinions (results on other datasets are consistent and 
thus not shown due to space limitation). Note that EM is a 
special case of EMS (with the smoothing parameter h = 0, 
i.e., no similar hyperedge used for smoothing). EMS outper¬ 
forms EM on both precision and Fl-score in all settings of p 
(the maximum size of hyperedges) and h tested. Moreover, 
the precision and Fl-score both increases with h as a larger 
h overcomes data sparsity significantly. As p increases, more 
combinations of social influence and item inference can be 
captured. Therefore, the experiments show that a higher p 
improves Fl-score without degrades the precision. It mani¬ 
fests that the learned hyperedges are effective for predicting 


triggered purchases. 


6.3 Algorithm Effectiveness and Efficiency 

We evaluate HAG proposed for SIMP, by selecting top 10 
items as the marketing items to measure their total adop¬ 
tion, in comparison with a number of baselines: 1) Random 
approach (RAN). It randomly selects k nodes as seeds. Note 
that the reported values are the average of 50 random seed 
sets. 2) Single node selection approach (SNS). It selects a 
node with the largest increment of the total adoption in 
each iteration, until k seeds are selected, which is widely 
employed in conventional seed selection problem inmani]- 
3) Social approach (SOC). It only considers the social influ¬ 
ence in selecting the k seeds. The hyperedges with nodes 
from different products are eliminated in the seed selection 
process, but they are restored for calculation of the final to¬ 
tal adoption. 4) Rem approach (IOC). The seed set is the 
same as HAG, but the prediction is based on item inference 
only. For each seed set selected by the above approaches, 
the diffusion process is simulated 300 times. We report the 
average in-degree of nodes learned from the three datasets in 
the following: Douban is 39.56; Gowalla is 9.90; Epinions is 
14.04. In this section, we evaluate HAG by varying the num¬ 
ber of seeds (i.e., k) using two metrics: 1) total adoption, 
and 2) running time. 

To understand the effectiveness, we first compared all 
those approaches with the optimal solution (denoted as 
OPT) in a small subgraph sampled, Sample, from the SIG of 
Douban with 50 nodes and 58 hyperedges. Figures]!)] (a) dis¬ 
plays the total adoption obtained by different approaches. 
As shown, HAG performs much better than the baselines 
and achieves comparable total adoption with OPT (the dif¬ 
ference decreases with increased k). Note that OPT is not 
scalable as shown in Figures [(3 (b) since it needs to examine 
all combination with k nodes. Also, OPT takes more than 
1 day for selecting 6 seeds in Sample. Thus, for the rest of 
experiments, we exclude OPT. 

Figures flOl lal-lcl compare the total adoptions of different 
approaches in the SIG learnt from real networks. They all 
grow as k increases, since a larger k increases the chance for 
seeds to influence others to adopt items. Figure [TcT] (a)-(c) 
manifest that HAG outperforms all the other baselines for 
any k in SIG model. Among them, SOC fails to find good 
solutions since item inference is not examined during seed 
selection. IOC performs poorly without considering social 
influence. SNS only includes one seed at a time without con¬ 
sidering the combination of nodes that may activate many 
other nodes via hyperedges. 

Figure [TO] (d) reports the running time of those ap¬ 
proaches. Note that the trends upon Gowalla and Epin¬ 
ions are similar with Douban. Thus we only report the run¬ 
ning time of Douban due to the space constraint. Taking 
the source combinations into account, HAG examines source 
combinations of hyperedges in Eh and obtains a better solu- 
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tion by spending more time since the number of hyperedges 
is often much higher than the number of nodes n 

6.4 Online Diffusion Processing 

Diffusion processing is an essential operation in HAG. We 
evaluate the efficiency of diffusion processing based on SIG- 
index (denoted as SX), in terms of the running time, in 
comparison with that based on the original Monte Carlo 
simulation (denoted as MC) and the sorting enhancement 
(denoted as SORTING), which accesses the hyperedges in 
descending order of their weights. Figure fill plots the run¬ 
ning time of SX, SORTING, and MC under various k using 
the Douban, Gowalla, and Epinions. For each k, the av¬ 
erage running times of 50 randomly selected seed sets for 
SX, SORTING, and MC, are reported. The diffusion pro¬ 
cess is simulated 300 times for each seed set. As Figure [lT] 
depicts, the running time for all the three approaches grows 
as k increases, because a larger number of seeds tends to 
increas the chance for other nodes to be activated. Thus, 
it needs more time to diffuse. Notice that SX takes much 
less time than SORTING and MC, because SX avoids ac¬ 
cessing hyperedges with no source nodes newly activated 
while calculating the activation probability. Moreover, the 
SIG-index is updated dynamically according to the activated 
nodes in diffusion process. Also note that the improvement 
by MC over SORTING in Douban is more significant than 
that in Gowalla and Epinions, because the average in-degree 
of nodes is much larger in Douban. Thus, activating a des¬ 
tination at an early stage can effectively avoid processing 
many hyperedges later. 


14 To further reduce the running time, we eliminate the source 
combinations with little gain in previous iterations. 


7. CONCLUSION 

In this paper, we argue that existing techniques for item 
inference recommendation and seed selection need to jointly 
take social influence and item inference into consideration. 
We propose Social Item Graph (SIG) for capturing purchase 
actions and predicting potential purchase actions. We pro¬ 
pose an effective machine learning approach to construct a 
SIG from purchase action logs and learn hyperedge weights. 
We also develop efficient algorithms to solve the new and 
challenging Social Item Maximization Problem (SIMP) that 
effectively select seeds for marketing. Experimental results 
demonstrate the superiority of the SIG model over existing 
models and the effectiveness and efficiency of the proposed 
algorithms for processing SIMP. We also plan to further ac¬ 
celerate the diffusion process by indexing additional infor¬ 
mation on SIG-index. 
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